Predicting Survival in AML Patients from RNASeq Data using SVM

[redacted]

Introduction

Acute Myeloid Leukemia (AML)

  • A cancer of the blood and bone marrow that affects the myeloid cells.
  • Usually very aggressive with limited therapeutic options.
  • Only ~20% patients can achive durable remission. (Pulte et al. 2016)
  • Prognosis is usually determined by cytogenetic and molecular markers. They are important factors in determining the appropriate treatment plan.

Dataset & Objectives

As part of the Beat AML 2.0 program, a dataset consisting of clinical outcomes, genomic and transcriptomic data from a cohort of 805 AML patients were collected. (Bottomly et al. 2022)

Among which 571 patients have both RNASeq data and either survived for at least one year after diagnosis (n = 311) or have died (n = 260). We propose a methodology to predict the one-year survival of AML patients using a support vector machine (SVM) model based on transcriptomic (RNASeq) data of these patients.

Methodology

Data Preprocessing and Normalization

  • The raw read count data was normalized into z-scores across the features.
  • The data was then split into training and testing sets with a 80/20 ratio.
  • RNAseq data was then joined with the clinical data to form the final dataset.
# A tibble: 571 × 51,017
   survived ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457
   <lgl>              <dbl>           <dbl>           <dbl>           <dbl>
 1 TRUE              -0.434          -0.123         -0.526          -0.330 
 2 TRUE              -0.322          -0.123         -1.10            0.900 
 3 TRUE              -0.479          -0.123          0.0255         -0.737 
 4 TRUE              -0.299          -0.123         -0.452          -0.336 
 5 TRUE               0.240          -0.123         -0.992           0.696 
 6 FALSE             -0.434          -0.123         -1.25           -1.22  
 7 TRUE               0.689          -0.123         -0.757          -0.586 
 8 TRUE              -0.389          -0.123         -0.927           0.0561
 9 TRUE              -0.479          -0.123         -1.09            0.582 
10 FALSE             -0.479          -0.123         -0.755          -0.796 
# ℹ 561 more rows
# ℹ 51,012 more variables: ENSG00000000460 <dbl>, ENSG00000000938 <dbl>,
#   ENSG00000000971 <dbl>, ENSG00000001036 <dbl>, ENSG00000001084 <dbl>,
#   ENSG00000001167 <dbl>, ENSG00000001460 <dbl>, ENSG00000001461 <dbl>,
#   ENSG00000001497 <dbl>, ENSG00000001561 <dbl>, ENSG00000001617 <dbl>,
#   ENSG00000001626 <dbl>, ENSG00000001629 <dbl>, ENSG00000001630 <dbl>,
#   ENSG00000001631 <dbl>, ENSG00000002016 <dbl>, ENSG00000002079 <dbl>, …

Methodology

Feature Selection

  • We used the Boruta algorithm (Kursa and Rudnicki 2010) to select the most important features.
  • A brief explanation of the Boruta algorithm:
    • It first creates shadow features by randomly permuting the original features.
    • It then fits a random forest model using both the original and shadow features.
    • If a feature scored a higher Z-score than the shadow features, it is considered important.
    • If a feature repeatedly scored similar to the shadow features, it is considered unimportant and removed from further iterations.
    • The algorithm iteratively removes the unimportant features until all features are either important or unimportant.
  • We ran the Boruta algorithm for 1000 iterations and reduced 51016 => 60 important features.

Methodology

Model Evaluation

  • 80/20 split for training and testing.
  • We care about all 4 quadrants of the confusion matrix!
  • Matthews Correlation Coefficient (MCC) as the primary metric to evaluate the performance of the SVM model.
  • \[ \text{MCC} = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}} \]
  • We also used the ROC curve to visualize some of the results.
  • Ranges from -1 to 1, where 1 is a perfect prediction, 0 is a random prediction, and -1 is a perfect inverse prediction.

Methodology

SVM - First Try

  • Short introduction to SVM:
    • A supervised machine learning algorithm that can be used for classification or regression tasks.
    • It finds the hyperplane that best separates the data into different classes.
    • It can be used for both linear and non-linear classification tasks using the kernel trick.
  • We fit a default SVM model using the linear kernel and either the whole dataset or the 60 important features.

ROC Curve for bare-bones SVM

Methodology

SVM - Non-linear Models

We used two non-linear SVM models:

  • The gaussian kernel: \[ K(x, x') = \exp\left(-\frac{\|x - x'\|^2}{\epsilon}\right) \]

  • The polynomial kernel.

    \[ K(x, x') = (<x, x'> + C)^d \]

MCC performance for different kernels

Methodology

SVM - Hyperparameter Tuning

\[ K(x, x') = \exp\left(-\frac{\|x - x'\|^2}{\epsilon}\right) \]

In addition to the \(\epsilon\) value for the gaussian kernel, we also need to tune the \(c\) value for the SVM algorithm.

  • A low \(c\) value allows for a wider margin but more misclassification.
  • A high \(c\) value allows for a narrower margin but less misclassification.

The Gaussian Kernel

MCC performance for different hyperparameters

Results

SVM - Variable Importance

  • We used a technique called permutation importance to determine the importance of each feature in the SVM model.
# A tibble: 60 × 5
  feature         delta_mcc display_label description                    biotype
  <chr>               <dbl> <chr>         <chr>                          <chr>  
1 ENSG00000198933   -0.109  TBKBP1        TBK1 binding protein 1 [Sourc… protei…
2 ENSG00000226419   -0.0915 SLC16A1-AS1   SLC16A1 antisense RNA 1 [Sour… antise…
3 ENSG00000042832   -0.0915 TG            thyroglobulin [Source:HGNC Sy… protei…
4 ENSG00000165138   -0.0915 ANKS6         ankyrin repeat and sterile al… protei…
5 ENSG00000122378   -0.0740 FAM213A       family with sequence similari… protei…
# ℹ 55 more rows
  • TBKBP1, part of a growth factor signaling axis, already proposed by existing literature as potential tumor growth mediator. (Zhu et al. 2019)
  • SLC16A1-AS1, multiple research shows it regulates cell cycle in oral squamous cell carcinoma (Feng et al. 2020), clinical data suggest that it contributes to the progression of hepatocellular carcinoma. (Duan 2022)
  • TG, a gene that encodes for thyroglobulin, a precursor of thyroid hormones.

Results

SVM - Training the Final Model

  • We still can do something to improve the model (in terms of increasing the prediction power).
  • We have limited number of samples, so we can conserve them by using a 10-fold cross-validation instead of the 80/20 split we’ve been using.
  • A predictor that has seen the input has 10% vote in this!

ROC curve of the final model

License & References

  • BeatAML2.0 Data is used under the CC-BY-4.0 license. See reference (Bottomly et al. 2022).
  • Linfa, a Rust machine learning framework, is used under the MIT license.
Bottomly, Daniel, Nicola Long, Anna Reister Schultz, Stephen E. Kurtz, Cristina E. Tognon, Kara Johnson, Melissa Abel, et al. 2022. “Integrative Analysis of Drug Response and Clinical Outcome in Acute Myeloid Leukemia.” Cancer Cell 40 (8): 850–864.e9. https://doi.org/10.1016/j.ccell.2022.07.002.
Duan, Chun. 2022. LncRNA SLC16A1-AS1 Contributes to the Progression of Hepatocellular Carcinoma Cells by Modulating miR-411/MITD1 Axis.” Journal of Clinical Laboratory Analysis 36 (4): e24344. https://doi.org/10.1002/jcla.24344.
Feng, Hao, Xiaoqi Zhang, Wenli Lai, and Jian Wang. 2020. “Long Non-Coding RNA SLC16A1-AS1: Its Multiple Tumorigenesis Features and Regulatory Role in Cell Cycle in Oral Squamous Cell Carcinoma.” Cell Cycle (Georgetown, Tex.) 19 (13): 1641–53. https://doi.org/10.1080/15384101.2020.1762048.
Kursa, Miron B., and Witold R. Rudnicki. 2010. “Feature Selection with the Boruta Package.” Journal of Statistical Software 36 (September): 1–13. https://doi.org/10.18637/jss.v036.i11.
Pulte, Dianne, Lina Jansen, Felipe A. Castro, Agne Krilaviciute, Alexander Katalinic, Benjamin Barnes, Meike Ressing, et al. 2016. “Survival in Patients with Acute Myeloblastic Leukemia in Germany and the United States: Major Differences in Survival in Young Adults.” International Journal of Cancer 139 (6): 1289–96. https://doi.org/10.1002/ijc.30186.
Zhu, Lele, Yanchuan Li, Xiaoping Xie, Xiaofei Zhou, Meidi Gu, Zuliang Jie, Chun-Jung Ko, et al. 2019. TBKBP1 and TBK1 Form a Growth Factor Signaling Axis Mediating Immunosuppression and Tumorigenesis.” Nature Cell Biology 21 (12): 1604–14. https://doi.org/10.1038/s41556-019-0429-8.